Functional Anomaly Detection

Author

Daniel Krasnov

Published

November 18, 2025

Introduction

This document showcases my master’s thesis work on functional anomaly detection. Specifically, I created a Bayesian nonparametric method for detecting multivariate functional anomalies. The method combines:

  • Wavelet decompositions for mean function representation.
  • Dirichlet Process (DP) mixture models for automatic cluster number determination.
  • Carlin-Chib step for automatic covariance kernels selection.
  • Spike-slab priors in the wavelet domain for smoothing.

In general, what makes anomaly detection different than clustering is the class imbalance. We assume each dataset is 15% anomalies and do not give any prior information on the number or type of anomalies. To give the model some sense of normal, we also reveal 15% of the normal class labels making this a semi-supervised learning problem.

To showcase model performance, I analyze two datasets from the Time Series Machine Learning Website:

  • Asphalt Regularity: Road surface regularity measurements
  • Character Trajectories: Handwritten character trajectories

Mathematical Formulation

First, I give a brief overview of the model before I showcase the applications. If you would like a full introduction to the model, please refer to my thesis which will be published soon.

Model Structure

Let \(\mathbf{Y} = \{Y_1, \ldots, Y_N\}\) be a collection of \(N\) functional observations, where each \(Y_i: \mathcal{T} \to \mathbb{R}^M\) is a multivariate function defined on a time domain \(\mathcal{T}\).

The model assumes that each function \(Y_i\) belongs to one of \(K\) clusters (where \(K\) is potentially infinite), and functions within the same cluster share similar characteristics. The model can be written as:

\[Y_i(t) = \mu_{z_i}(t) + \epsilon_i(t), \quad i = 1, \ldots, N\]

where:

  • \(z_i \in \{1, 2, \ldots\}\) is the cluster assignment for observation \(i\)
  • \(\mu_k(t)\) is the mean function for cluster \(k\)
  • \(\epsilon_i(t)\) is a zero-mean error process

Dirichlet Process Prior

The cluster assignments follow a Dirichlet Process (DP) prior, which allows for an infinite number of clusters:

\[z_i \mid \boldsymbol{\pi} \sim \text{Categorical}(\boldsymbol{\pi})\]

where \(\boldsymbol{\pi} = (\pi_1, \pi_2, \ldots)\) are the cluster probabilities. The DP is constructed using the stick-breaking representation:

\[\pi_k = v_k \prod_{j=1}^{k-1}(1 - v_j), \quad v_k \sim \text{Beta}(1, \alpha)\]

Here, \(\alpha > 0\) is the concentration parameter that controls the number of clusters. In order to avoid arbitrary truncation of the infinite sum, we use Walker’s slice sampling method (Walker 2007).

Wavelet Representation

Each mean function \(\mu_k(t)\) is represented in a wavelet basis. Consider performing a discrete wavelet transform on our data. Then, for each dimension \(m\), we have \[ \mu_{k,m}(t) = \sum_{\ell=1}^{L} \alpha_{k,m,\ell}\,\phi_{\ell}(t) + \sum_{j=1}^{J}\sum_{\ell=1}^{L_j}\beta_{k,m,j,\ell}\,\psi_{j,\ell}(t), \]

where \(\phi_\ell(t)\) are scaling functions and \(\psi_{j,\ell}(t)\) are wavelet functions with scale \(j\) and position \(\ell\) (Ray and Mallick 2006).

Spike-Slab Prior

Each detail coefficient \(\beta_{k,m,j,\ell}\) follows a spike–and–slab prior:

\[ \beta_{k,m,j,\ell} \mid \gamma_{k,m,j,\ell} \sim (1-\gamma_{k,m,j,\ell})\,\delta_0 + \gamma_{k,m,j,\ell}\,\mathcal{N}\!\big(0,\,\sigma_{k,m}^2g_{k,j}\big), \qquad \gamma_{k,m,j,\ell} \sim \mathrm{Bernoulli}(\pi_{k,j}). \]

Here \(\gamma_{k,m,j,\ell}\) indicates whether a coefficient is active, \(\sigma_{k,m}^2\) is the per-dimension variability, and \(g_{k,j}\) is the per-level variation. The inclusion probability \(\pi_{k,j}\) controls the expected sparsity of level \(j\) and is given a Beta prior: \[ \pi_{k,j} \sim \mathrm{Beta}\!\big(\tau_\pi m_j,\;\tau_\pi (1-m_j)\big), \qquad m_j = \kappa_\pi\, 2^{-c_2 j}. \]

Under this parameterization,

\[ \mathbb{E}[\pi_{k,j}] = m_j = \kappa_\pi\, 2^{-c_2 j}, \]

so that finer scales have smaller expected inclusion probabilities (Ray and Mallick 2006).

Kernel Selection

Let \(s_c\in\{1,\dots,R\}\) denote the kernel–family indicator for cluster \(c\), which corresponds to one covariance family from

\[ \mathcal K = \Big\{ k^{(\mathrm{SE})},\; k^{(\mathrm{Mat}\,3/2)},\; k^{(\mathrm{Mat}\,5/2)},\; k^{(\mathrm{Per})},\; k^{(\mathrm{RQ})},\; k^{(\mathrm{PowExp})},\; k^{(\mathrm{GibbsPC3})},\; k^{(\mathrm{CP\text{-}M52})} \Big\}. \]

These denote the squared exponential, Matérn with \(\nu=3/2\), Matérn with \(\nu=5/2\), periodic, rational quadratic, powered exponential, Gibbs with piecewise–constant lengthscale in three bins, and changepoint Matérn–\(5/2\) kernels respectively. Each family \(r\) has its own hyperparameters \(\phi_{c,r}\), and all kernels are normalized to have unit marginal variance for identifiability.

Kernel switching is carried out using the method of Carlin and Chib (Carlin and Chib 1995). For each cluster \(c\) and each kernel family \(r\), we define a parameter vector \(\phi_{c,r}\) and refresh it from its pseudoprior when that kernel is inactive. The pseudopriors are chosen to match the true priors so that the ratio \(p_r/q_r\) cancels. Only the active pair \((s_c,\phi_{c,s_c})\) contributes to the likelihood, and the indicator \(s_c\) is sampled from a categorical distribution with weights

\[ w_r \propto \mathcal{N}\!\Big( Y_{c,\mathrm{res}}\mid 0,\, \tau_{B,c}\,B_c^{\text{shape}}\otimes K_x^{(r)}(\phi_{c,r}) + \mathrm{diag}(\eta_c)\otimes I_P \Big), \qquad \pi(s_c{=}r)=1/R. \]

ICM

In order to accomdate multivarte functions we represent the residual process within each cluster as a multi-output Gaussian process. Let \(Y_{c,\mathrm{res}}\in\mathbb{R}^{P\times M}\) denote the residuals for cluster \(c\), where \(P\) is the number of time points and \(M\) the number of dimensions. Vectorizing columnwise, the model assumes

\[ \operatorname{vec}(Y_{c,\mathrm{res}}) \sim \mathcal{N}\!\left(0,\; K_x^{(s_c)} \otimes B_c \,+\, I_P \otimes \mathrm{diag}(\eta_c)\right), \]

where \(K_x^{(s_c)}\) is the \(P\times P\) covariance matrix corresponding to the Carlin–Chib selected kernel family \(s_c\), \(B_c\) is an \(M\times M\) positive semi-definite coregionalization matrix, and \(\eta_c\in\mathbb{R}_{+}^M\) contains dimension-specific noise variances. This is known as the Intrinsic Coregionalization Model (ICM) (Alvarez et al. 2012).

With this we may now write the full hierarchical model: \[ \begin{align*} \alpha &\sim \mathrm{Gamma}(a_\alpha,b_\alpha) \\[6pt] v_k \mid \alpha &\sim \mathrm{Beta}(1,\alpha), \qquad \pi_k \;=\; v_k \prod_{j<k} (1-v_j) \\[6pt] z_i \mid \{\pi_k\} &\sim \mathrm{Categorical}(\{\pi_k\}) \\[10pt] % \theta_k &\sim G_0 \\[2pt] \theta_k &= \Big( \{\beta_{k,m,j,\ell}\},\; \{\gamma_{k,m,j,\ell}\},\; \{g_{k,j}\},\; \{\pi_{k,j}\},\; \{\sigma_{k,m}^2\},\; \tau_{\sigma,k}, \\[-2pt] &\hspace{2cm} \tau_{B,k},\; B_k^{\text{shape}},\; \eta_k,\; s_k,\; \{\phi_{k,r}\}_{r=1}^R \Big) \\[12pt] \pi_{k,j} &\sim \mathrm{Beta}(\tau_\pi m_j,\,\tau_\pi(1-m_j)), \qquad m_j=\kappa_\pi\,2^{-c_2 j} \\[4pt] \gamma_{k,m,j,\ell}\mid \pi_{k,j} &\sim \mathrm{Bernoulli}(\pi_{k,j}) \\[4pt] \beta_{k,m,j,\ell}\mid \gamma_{k,m,j,\ell}, g_{k,j}, \sigma_{k,m}^2 &\sim (1-\gamma_{k,m,j,\ell})\,\delta_0 + \gamma_{k,m,j,\ell}\,\mathcal N(0,\,g_{k,j}\sigma_{k,m}^2) \\[4pt] g_{k,j} &\sim \mathrm{Inv\text{-}Gamma}(a_g,b_g), \qquad \sigma_{k,m}^2 \mid \tau_{\sigma,k} \sim \mathrm{Inv\text{-}Gamma}(a_\sigma,\, b_\sigma \tau_{\sigma,k}) \\[4pt] \tau_{\sigma,k} &\sim \mathrm{Gamma}(a_\tau,b_\tau) \\[12pt] B_k^{\text{shape}} &= \dfrac{L_k L_k^\top}{\mathrm{tr}(L_k L_k^\top)} \\[4pt] (L_k)_{ab} &\sim \mathcal N(0,1)\ (a\ge b), \qquad \eta_{k,m} \sim \mathrm{Inv\text{-}Gamma}(a_\eta,b_\eta) \\[12pt] \phi_{k,r} &\sim p_r(\cdot),\quad r=1,\dots,R \\[4pt] s_k &\sim \mathrm{Categorical}\!\big(\tfrac{1}{R},\dots,\tfrac{1}{R}\big) \\[4pt] w_{k,r} &\propto \tfrac{1}{R}\; \mathcal N\!\Big( \mathrm{vec}(Y_{k,\mathrm{res}})\mid 0,\; \tau_{B,k} B_k^{\text{shape}}\!\otimes\! K_x^{(r)}(\phi_{k,r}) + \mathrm{diag}(\eta_k)\!\otimes\! I_P + \alpha_{\mathrm{spr}}\,\bigl(1 - \mathrm{Sparsity}(w)\bigr) \Big) \end{align*} \] and the selected wavelet is \[ w^{\ast} = \arg\min_{w \in \mathcal{W}} \mathcal{L}(w). \]

Code
# Reveal 15% of normals for wavelet selection
normal_indices = findall(==(normal_class), y_full)
n_revealed = max(1, round(Int, 0.15 * length(normal_indices)))
revealed_subset = sample(normal_indices, n_revealed, replace=false)
revealed_idx = sort(revealed_subset)

# Convert to vector format for wavelet selection
Y_vec = [X[i, :] for i in 1:size(X, 1)]
Y_mats = [reshape(Float64.(Yi), :, 1) for Yi in Y_vec]

sel = WICMAD.KernelSelection.select_wavelet(Y_mats, t, revealed_idx;
            wf_candidates = nothing, J = nothing, boundary = "periodic",
            mcmc = (n_iter=3000, burnin=1000, thin=1),
            verbose = false)
selected_wf = sel.selected_wf;
Code
Y_raw = [reshape(X[i, :], :, 1) for i in 1:size(X, 1)]

# Reset seed for reproducibility
Random.seed!(42)

res_asphalt = wicmad(Y_raw, t;
    n_iter=5000,
    burn=2000,
    thin=1,
    wf=selected_wf,
    diagnostics=true,
    bootstrap_runs=0,
    revealed_idx=revealed_idx,
    verbose=false,
)

# Compute MAP estimate
mapr_asphalt = map_from_res(res_asphalt)
z_final_asphalt = mapr_asphalt.z_hat
ari_asphalt = WICMAD.adj_rand_index(z_final_asphalt, z_true)

# Evaluate performance
pred_bin_asphalt = to_binary_labels(z_final_asphalt);
Code
# Create side-by-side before/after comparison plot
p_asphalt = WICMAD.Plotting.plot_clustering_comparison(
    Y_raw, gt_binary, pred_bin_asphalt, t;
    save_path=nothing
)
plot!(p_asphalt, plot_title="Asphalt Regularity", plot_titlefontsize=16)
p_asphalt
Code
confusion_matrix("Asphalt Regularity", gt_binary, pred_bin_asphalt)
------------------------------------------------------------
Confusion matrix — Asphalt Regularity
 (rows = true class, cols = predicted class; normal=0, anomaly=1)

          Pred=0    Pred=1
True=0    542       217
True=1    4         130

N = 893   Accuracy = 0.753   Precision = 0.375   Recall = 0.970   F1 = 0.541
------------------------------------------------------------
(tp = 130, tn = 542, fp = 217, fn = 4, accuracy = 0.7525195968645016, precision = 0.3746397694524496, recall = 0.9701492537313433, f1 = 0.5405405405405406)

Dataset 1: Character Trajectories

The Character Trajectories dataset is a multivariate functional dataset containing 2858 character samples. The three dimensions recorded are the x coordinates, y coordinates, and pen tip force. Data come pre-smoothed and were captured at 200 hertz. The data have a class for each letter in the alphabet. For our example, we assumed the letter “a” to be the normal class and that the letter “b” as the anomalous class. All other classes were discarded.

Code
# Reset seed for reproducibility (before data sampling)
Random.seed!(42)

# Ensure data_dir is available
data_dir = joinpath(dirname(@__DIR__), "data")

ds_char = WICMAD.Utils.load_ucr_dataset("CharacterTrajectories/CharacterTrajectories", data_dir)

# Select normal and anomaly classes
normal_class_char = "1"  # Corresponds to 'a'
cm_char = countmap(ds_char.labels)
classes_sorted_char = sort(collect(keys(cm_char)); by = c -> -cm_char[c])
other_classes = filter(c -> c != normal_class_char, classes_sorted_char)
anomaly_class_char = isempty(other_classes) ? error("No other classes found") : other_classes[1]

normal_indices_full_char = findall(==(normal_class_char), ds_char.labels)
anomaly_indices_full_char = findall(==(anomaly_class_char), ds_char.labels)

# Target anomaly proportion: 15%
p_anom = 0.15
p_normal = 0.85

avail_norm_char = length(normal_indices_full_char)
avail_anom_char = length(anomaly_indices_full_char)

n_normal_used_char = avail_norm_char
n_anomaly_needed = round(Int, n_normal_used_char * p_anom / p_normal)
n_anomaly_used_char = min(avail_anom_char, n_anomaly_needed)

if n_anomaly_used_char < n_anomaly_needed
    n_normal_used_char = min(avail_norm_char, round(Int, n_anomaly_used_char * p_normal / p_anom))
end

used_normal_indices_char = sample(normal_indices_full_char, n_normal_used_char, replace=false)
used_anomaly_indices_char = sample(anomaly_indices_full_char, n_anomaly_used_char, replace=false)
all_used_indices_char = vcat(used_normal_indices_char, used_anomaly_indices_char)
shuffle!(all_used_indices_char)

# Subset data
Y_full_char = ds_char.series[all_used_indices_char]
y_full_char = [ds_char.labels[i] for i in all_used_indices_char]

# Interpolate
Y_interp_char, P_char = interpolate_to_length(Y_full_char; target_len=32)
t_char = collect(1:P_char)

# Ground-truth labels
gt_binary_char = [yi == normal_class_char ? 0 : 1 for yi in y_full_char]
z_true_char = relabel_consecutive(y_full_char);
Code
# Wavelet selection
normal_indices_char = findall(==(normal_class_char), y_full_char)
n_revealed_char = max(1, round(Int, 0.15 * length(normal_indices_char)))
revealed_subset_char = sample(normal_indices_char, n_revealed_char, replace=false)
revealed_idx_char = sort(revealed_subset_char)

if !isempty(revealed_idx_char)
    try
        sel_char = WICMAD.KernelSelection.select_wavelet(Y_interp_char, t_char, revealed_idx_char;
            wf_candidates = nothing, J = nothing, boundary = "periodic",
            mcmc = (n_iter=3000, burnin=1000, thin=1),
            verbose = false)
        selected_wf_char = sel_char.selected_wf
    catch
        selected_wf_char = "sym8"
    end
else
    selected_wf_char = "sym8"
end

# Run WICMAD
# Reset seed for reproducibility
Random.seed!(42)

res_char = wicmad(Y_interp_char, t_char;
    n_iter=5000,
    burn=2000,
    thin=1,
    wf=selected_wf_char,
    diagnostics=true,
    bootstrap_runs=0,
    revealed_idx=revealed_idx_char,
    verbose=false,
)

# Compute MAP estimate
mapr_char = map_from_res(res_char)
z_final_char = mapr_char.z_hat
ari_char = WICMAD.adj_rand_index(z_final_char, z_true_char)

# Evaluate performance
pred_bin_char = to_binary_labels(z_final_char);
Code
# Create side-by-side before/after comparison plot
p_char = WICMAD.Plotting.plot_clustering_comparison(
    Y_interp_char, gt_binary_char, pred_bin_char, t_char;
    save_path=nothing
)
plot!(p_char, plot_title="Character Trajectories", plot_titlefontsize=16)
p_char
Code
confusion_matrix("Character Trajectories", gt_binary_char, pred_bin_char)
------------------------------------------------------------
Confusion matrix — Character Trajectories
 (rows = true class, cols = predicted class; normal=0, anomaly=1)

          Pred=0    Pred=1
True=0    171       0
True=1    0         30

N = 201   Accuracy = 1.000   Precision = 1.000   Recall = 1.000   F1 = 1.000
------------------------------------------------------------
(tp = 30, tn = 171, fp = 0, fn = 0, accuracy = 1.0, precision = 1.0, recall = 1.0, f1 = 1.0)

Discussion

The above examples demonstrate that the model generalizes well to real-world data. We achieve high recall at the cost of some precision, resulting in a competitive F1 score for an anomaly detection model.

The Character Trajectores dataset represents ideal data for our model. They are smooth sensor data that are mostly stationary. Unsurprisingly, our model achieves perfect classification performance across all metrics. We further expect this, as these data contain shape anomalies, which our model was shown to detect easily in our simulated experiments.

The Asphalt Regularity data proved to be more challenging. Here, the model attained a high recall, but a weaker precision. These data are far less smooth than the previous dataset, and thus, the Gaussian Process assumption is less appropriate. However, we showcased this example as we wished to investigate our model’s performance on less ideal data. We found that we nearly identified all anomalous observations at the cost of some false positives. We find this trade-off to be acceptable, as in most anomaly detection settings, false positives are less detrimental than a missed anomaly.

References

Alvarez, Mauricio A, Lorenzo Rosasco, Neil D Lawrence, et al. 2012. “Kernels for Vector-Valued Functions: A Review.” Foundations and Trends in Machine Learning 4 (3): 195–266.
Carlin, Bradley P, and Siddhartha Chib. 1995. “Bayesian Model Choice via Markov Chain Monte Carlo Methods.” Journal Of The Royal Statistical Society Series B: Statistical Methodology 57 (3): 473–84.
Ray, Shubhankar, and Bani Mallick. 2006. “Functional Clustering by Bayesian Wavelet Methods.” Journal of the Royal Statistical Society Series B: Statistical Methodology 68 (2): 305–32. https://doi.org/10.1111/j.1467-9868.2006.00545.x.
Walker, Stephen G. 2007. “Sampling the Dirichlet Mixture Model with Slices.” Communications in Statistics - Simulation and Computation 36 (1): 45–54. https://doi.org/10.1080/03610910601096262.